-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Fix gtNodeHasSideEffects
checking call arguments
#106185
Conversation
`gtNodeHasSideEffects` is meant to check if a node has side effects when you exclude its children. However, it was checking arguments of calls which is more conservative than expected. The actual reason we were doing that seems to be `gtTreeHasSideEffects` that sometimes wants to ignore `GTF_CALL` on pure helper calls. It was relying on this check in `gtNodeHasSideEffect`; instead move it to `gtTreeHasSideEffects` where it belongs. This is an alternative fix for dotnet#106129; there we leave a `COMMA(CORINFO_HELP_RUNTIMELOOKUP_METHOD, ...)` around because extracting side effects from op1 does not end up getting rid of the call. Fix dotnet#106129
@EgorBot -amd -arm64 using System.Net;
using BenchmarkDotNet.Attributes;
namespace genericlist
{
public sealed class BinaryTrie
{
[System.Runtime.CompilerServices.InlineArray(2)]
public struct TwoBranches<TNodeLeaf>
{
private TNodeLeaf? _ref0;
}
public class Node<TNodeLeaf>
{
public TwoBranches<Node<TNodeLeaf>> Branches;
public Node<TNodeLeaf> n0;
public Node<TNodeLeaf> n1;
public bool IsLeaf;
public TNodeLeaf? NodeLeaf;
}
public static TLeaf? LookupInlineArray<TLeaf>(Node<TLeaf> root)
{
Node<TLeaf> currentNode = root;
TLeaf? result = default;
for (int i = 0; i < 8; i++)
{
int bit = i % 2;
Node<TLeaf>? branch = currentNode.Branches[bit];
if (branch is not null)
{
return result;
}
}
return result;
}
public static TLeaf? LookupTernary<TLeaf>(Node<TLeaf> root)
{
Node<TLeaf> currentNode = root;
TLeaf? result = default;
for (int i = 0; i < 8; i++)
{
int bit = i % 2;
Node<TLeaf>? branch = bit == 0 ? currentNode.n0 : currentNode.n1;
if (branch is not null)
{
return result;
}
}
return result;
}
}
[DisassemblyDiagnoser]
public class Benchmarks
{
BinaryTrie.Node<bool> _bool;
BinaryTrie.Node<IPAddress> _ipaddress;
[GlobalSetup]
public void Setup()
{
_bool = new BinaryTrie.Node<bool>();
_ipaddress = new BinaryTrie.Node<IPAddress>();
}
[Benchmark]
public bool LookupBoolInlineArray()
{
return BinaryTrie.LookupInlineArray(_bool);
}
[Benchmark]
public bool LookupBoolTernary()
{
return BinaryTrie.LookupTernary(_bool);
}
[Benchmark(Baseline = true)]
public IPAddress LookupIPAddressInlineArray()
{
return BinaryTrie.LookupInlineArray(_ipaddress);
}
[Benchmark]
public IPAddress LookupIPAddressTernary()
{
return BinaryTrie.LookupTernary(_ipaddress);
}
}
} |
❌ Benchmark failed on Arm64
|
Benchmark results on Amd
|
Looking at some regressions.. `Microsoft.CodeAnalysis.CSharp.OverloadResolutionResult`1[System.__Canon]:get_HasAnyApplicableMember():ubyte:this (Tier1)` ; 2 inlinees with PGO data; 9 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
-; V00 this [V00,T01] ( 3, 3 ) ref -> rbx this class-hnd single-def <Microsoft.CodeAnalysis.CSharp.OverloadResolutionResult`1[System.__Canon]>
+; V00 this [V00,T03] ( 3, 3 ) ref -> rbx this class-hnd single-def <Microsoft.CodeAnalysis.CSharp.OverloadResolutionResult`1[System.__Canon]>
;* V01 loc0 [V01 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op <Microsoft.CodeAnalysis.PooledObjects.ArrayBuilder`1+Enumerator[Microsoft.CodeAnalysis.CSharp.MemberResolutionResult`1[System.__Canon]]>
;* V02 loc1 [V02 ] ( 0, 0 ) struct (128) zero-ref do-not-enreg[S] ld-addr-op <Microsoft.CodeAnalysis.CSharp.MemberResolutionResult`1[System.__Canon]>
;* V03 loc2 [V03 ] ( 0, 0 ) struct (104) zero-ref do-not-enreg[SF] ld-addr-op <Microsoft.CodeAnalysis.CSharp.MemberAnalysisResult>
@@ -17,101 +17,116 @@
;* V05 tmp1 [V05 ] ( 0, 0 ) long -> zero-ref "spilling helperCall"
;* V06 tmp2 [V06 ] ( 0, 0 ) long -> zero-ref "spilling helperCall"
;* V07 tmp3 [V07 ] ( 0, 0 ) long -> zero-ref "spilling helperCall"
-; V08 tmp4 [V08,T02] ( 3, 6 ) ref -> rax class-hnd exact single-def "Inlining Arg" <Microsoft.CodeAnalysis.PooledObjects.ArrayBuilder`1[Microsoft.CodeAnalysis.CSharp.MemberResolutionResult`1[System.__Canon]]>
+; V08 tmp4 [V08,T01] ( 3, 6 ) ref -> rax class-hnd exact single-def "Inlining Arg" <Microsoft.CodeAnalysis.PooledObjects.ArrayBuilder`1[Microsoft.CodeAnalysis.CSharp.MemberResolutionResult`1[System.__Canon]]>
;* V09 tmp5 [V09 ] ( 0, 0 ) struct (16) zero-ref ld-addr-op "NewObj constructor temp" <Microsoft.CodeAnalysis.PooledObjects.ArrayBuilder`1+Enumerator[Microsoft.CodeAnalysis.CSharp.MemberResolutionResult`1[System.__Canon]]>
;* V10 tmp6 [V10 ] ( 0, 0 ) long -> zero-ref "spilling helperCall"
;* V11 tmp7 [V11 ] ( 0, 0 ) ref -> zero-ref class-hnd exact "Inlining Arg" <Microsoft.CodeAnalysis.PooledObjects.ArrayBuilder`1[Microsoft.CodeAnalysis.CSharp.MemberResolutionResult`1[System.__Canon]]>
;* V12 tmp8 [V12 ] ( 0, 0 ) int -> zero-ref "Inlining Arg"
-; V13 tmp9 [V13,T07] ( 2, 4 ) ref -> r8 class-hnd exact "Inlining Arg" <System.Collections.Immutable.ImmutableArray`1+Builder[Microsoft.CodeAnalysis.CSharp.MemberResolutionResult`1[System.__Canon]]>
+; V13 tmp9 [V13,T05] ( 3, 4 ) ref -> r10 class-hnd exact "Inlining Arg" <System.Collections.Immutable.ImmutableArray`1+Builder[Microsoft.CodeAnalysis.CSharp.MemberResolutionResult`1[System.__Canon]]>
;* V14 tmp10 [V14 ] ( 0, 0 ) ubyte -> zero-ref "Inline return value spill temp"
;* V15 tmp11 [V15 ] ( 0, 0 ) ubyte -> zero-ref "Inline stloc first use temp"
;* V16 tmp12 [V16 ] ( 0, 0 ) int -> zero-ref "impAppendStmt"
;* V17 tmp13 [V17 ] ( 0, 0 ) ref -> zero-ref class-hnd exact "Inlining Arg" <Microsoft.CodeAnalysis.PooledObjects.ArrayBuilder`1[Microsoft.CodeAnalysis.CSharp.MemberResolutionResult`1[System.__Canon]]>
;* V18 tmp14 [V18 ] ( 0, 0 ) ref -> zero-ref class-hnd exact "Inlining Arg" <System.Collections.Immutable.ImmutableArray`1+Builder[Microsoft.CodeAnalysis.CSharp.MemberResolutionResult`1[System.__Canon]]>
-; V19 tmp15 [V19,T11] ( 2, 2 ) ref -> rax single-def "field V01._builder (fldOffset=0x0)" P-INDEP
-;* V20 tmp16 [V20,T14] ( 0, 0 ) int -> zero-ref "field V01._index (fldOffset=0x8)" P-INDEP
-; V21 tmp17 [V21,T12] ( 2, 2 ) ref -> rax single-def "field V09._builder (fldOffset=0x0)" P-INDEP
-;* V22 tmp18 [V22,T15] ( 0, 0 ) int -> zero-ref single-def "field V09._index (fldOffset=0x8)" P-INDEP
+; V19 tmp15 [V19,T09] ( 2, 2 ) ref -> rax single-def "field V01._builder (fldOffset=0x0)" P-INDEP
+;* V20 tmp16 [V20,T11] ( 0, 0 ) int -> zero-ref "field V01._index (fldOffset=0x8)" P-INDEP
+; V21 tmp17 [V21,T10] ( 2, 2 ) ref -> rax single-def "field V09._builder (fldOffset=0x0)" P-INDEP
+;* V22 tmp18 [V22,T12] ( 0, 0 ) int -> zero-ref single-def "field V09._index (fldOffset=0x8)" P-INDEP
;* V23 tmp19 [V23 ] ( 0, 0 ) ubyte -> zero-ref "V03.[004..005)"
-; V24 tmp20 [V24,T10] ( 3, 2.00) ubyte -> r8 "V02.[028..029)"
-; V25 tmp21 [V25,T08] ( 2, 4 ) byref -> r8 "Spilling address for field-by-field copy"
-; V26 tmp22 [V26,T03] ( 3, 6 ) ref -> r8 "arr expr"
-;* V27 cse0 [V27,T16] ( 0, 0 ) long -> zero-ref hoist "CSE #03: aggressive"
-; V28 cse1 [V28,T13] ( 2, 2 ) int -> rcx hoist "CSE #04: aggressive"
-; V29 cse2 [V29,T09] ( 3, 3 ) ref -> rax hoist "CSE #01: aggressive"
-; V30 rat0 [V30,T00] ( 6, 6 ) long -> rdx "Widened IV V20"
-;* V31 rat1 [V31,T05] ( 0, 0 ) long -> zero-ref "Spilling to split statement for tree"
-;* V32 rat2 [V32,T06] ( 0, 0 ) long -> zero-ref "runtime lookup"
-;* V33 rat3 [V33,T04] ( 0, 0 ) long -> zero-ref "fgMakeTemp is creating a new local variable"
+; V24 tmp20 [V24,T08] ( 3, 2.00) ubyte -> rax "V02.[028..029)"
+; V25 tmp21 [V25,T06] ( 2, 4 ) byref -> rax "Spilling address for field-by-field copy"
+; V26 tmp22 [V26,T02] ( 3, 6 ) ref -> rax "arr expr"
+; V27 cse0 [V27,T04] ( 4, 4 ) int -> r8 "CSE #02: aggressive"
+; V28 cse1 [V28,T07] ( 3, 3 ) ref -> rdx "CSE #01: aggressive"
+; V29 rat0 [V29,T00] ( 7, 7 ) long -> rcx "Widened IV V20"
;
-; Lcl frame size = 32
+; Lcl frame size = 48
G_M864_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
push rbx
- sub rsp, 32
+ sub rsp, 48
+ mov qword ptr [rsp+0x28], rcx
mov rbx, rcx
; gcrRegs +[rbx]
- ;; size=8 bbWeight=1 PerfScore 1.50
-G_M864_IG02: ; bbWeight=1, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
+ ;; size=13 bbWeight=1 PerfScore 2.50
+G_M864_IG02: ; bbWeight=1, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref, isz
mov rax, gword ptr [rbx+0x08]
; gcrRegs +[rax]
cmp byte ptr [rax], al
- mov rax, gword ptr [rax+0x08]
- mov ecx, dword ptr [rax+0x10]
- mov edx, 0xD1FFAB1E
- ;; size=18 bbWeight=1 PerfScore 9.25
-G_M864_IG03: ; bbWeight=1, gcrefRegs=0009 {rax rbx}, byrefRegs=0000 {}, byref, isz
- inc edx
- cmp edx, ecx
- jge SHORT G_M864_IG07
- mov r8, rax
- ; gcrRegs +[r8]
- mov r8, gword ptr [r8+0x08]
- cmp edx, dword ptr [r8+0x08]
- jae SHORT G_M864_IG09
+ xor ecx, ecx
+ mov rdx, gword ptr [rax+0x08]
+ ; gcrRegs +[rdx]
+ mov r8d, dword ptr [rdx+0x10]
+ test r8d, r8d
+ jle SHORT G_M864_IG09
+ ;; size=21 bbWeight=1 PerfScore 10.50
+G_M864_IG03: ; bbWeight=1, gcrefRegs=000C {rdx rbx}, byrefRegs=0000 {}, byref, isz
+ ; gcrRegs -[rax]
mov r10, rdx
+ ; gcrRegs +[r10]
+ cmp ecx, r8d
+ jge SHORT G_M864_IG08
+ mov rax, gword ptr [r10+0x08]
+ ; gcrRegs +[rax]
+ cmp ecx, dword ptr [rax+0x08]
+ jae SHORT G_M864_IG11
+ mov r10, rcx
+ ; gcrRegs -[r10]
shl r10, 7
- lea r8, bword ptr [r8+r10+0x10]
- ; gcrRegs -[r8]
- ; byrRegs +[r8]
- movzx r8, byte ptr [r8+0x1C]
- ; byrRegs -[r8]
- lea r10d, [r8-0x01]
+ lea rax, bword ptr [rax+r10+0x10]
+ ; gcrRegs -[rax]
+ ; byrRegs +[rax]
+ movzx rax, byte ptr [rax+0x1C]
+ ; byrRegs -[rax]
+ lea r10d, [rax-0x01]
cmp r10d, 1
ja SHORT G_M864_IG06
- ;; size=46 bbWeight=1 PerfScore 13.25
+ ;; size=43 bbWeight=1 PerfScore 13.00
G_M864_IG04: ; bbWeight=1, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
- ; gcrRegs -[rax]
+ ; gcrRegs -[rdx]
mov eax, 1
;; size=5 bbWeight=1 PerfScore 0.25
G_M864_IG05: ; bbWeight=1, epilog, nogc, extend
- add rsp, 32
+ add rsp, 48
pop rbx
ret
;; size=6 bbWeight=1 PerfScore 1.75
-G_M864_IG06: ; bbWeight=0.00, gcVars=0000000000000000 {}, gcrefRegs=0009 {rax rbx}, byrefRegs=0000 {}, gcvars, byref, isz
- ; gcrRegs +[rax]
- add r8d, -22
- cmp r8d, 1
- ja SHORT G_M864_IG03
- jmp SHORT G_M864_IG04
- ;; size=12 bbWeight=0.00 PerfScore 0.00
-G_M864_IG07: ; bbWeight=0, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
- ; gcrRegs -[rax]
+G_M864_IG06: ; bbWeight=0.00, gcVars=0000000000000000 {}, gcrefRegs=000C {rdx rbx}, byrefRegs=0000 {}, gcvars, byref, isz
+ ; gcrRegs +[rdx]
+ add eax, -22
+ cmp eax, 1
+ jbe SHORT G_M864_IG04
+ ;; size=8 bbWeight=0.00 PerfScore 0.00
+G_M864_IG07: ; bbWeight=1, gcrefRegs=000C {rdx rbx}, byrefRegs=0000 {}, byref, isz
+ inc ecx
+ cmp ecx, r8d
+ jl SHORT G_M864_IG03
+ jmp SHORT G_M864_IG09
+ ;; size=9 bbWeight=1 PerfScore 3.50
+G_M864_IG08: ; bbWeight=0, gcrefRegs=0408 {rbx r10}, byrefRegs=0000 {}, byref
+ ; gcrRegs -[rdx] +[r10]
+ mov rcx, qword ptr [r10]
+ mov rax, 0xD1FFAB1E ; code for System.Collections.Immutable.ImmutableArray`1+Builder[Microsoft.CodeAnalysis.CSharp.MemberResolutionResult`1[System.__Canon]]:ThrowIndexOutOfRangeException()
+ call [rax]System.Collections.Immutable.ImmutableArray`1+Builder[Microsoft.CodeAnalysis.CSharp.MemberResolutionResult`1[System.__Canon]]:ThrowIndexOutOfRangeException()
+ ; gcrRegs -[r10]
+ ; gcr arg pop 0
+ int3
+ ;; size=16 bbWeight=0 PerfScore 0.00
+G_M864_IG09: ; bbWeight=0, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, byref
xor eax, eax
;; size=2 bbWeight=0 PerfScore 0.00
-G_M864_IG08: ; bbWeight=0, epilog, nogc, extend
- add rsp, 32
+G_M864_IG10: ; bbWeight=0, epilog, nogc, extend
+ add rsp, 48
pop rbx
ret
;; size=6 bbWeight=0 PerfScore 0.00
-G_M864_IG09: ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, gcvars, byref
+G_M864_IG11: ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0008 {rbx}, byrefRegs=0000 {}, gcvars, byref
call CORINFO_HELP_RNGCHKFAIL
; gcr arg pop 0
int3
;; size=6 bbWeight=0 PerfScore 0.00
-; Total bytes of code 109, prolog size 8, PerfScore 26.00, instruction count 36, allocated bytes for code 109 (MethodHash=caadfc9f) for method Microsoft.CodeAnalysis.CSharp.OverloadResolutionResult`1[System.__Canon]:get_HasAnyApplicableMember():ubyte:this (Tier1)
+; Total bytes of code 135, prolog size 13, PerfScore 31.50, instruction count 45, allocated bytes for code 135 (MethodHash=caadfc9f) for method Microsoft.CodeAnalysis.CSharp.OverloadResolutionResult`1[System.__Canon]:get_HasAnyApplicableMember():ubyte:this (Tier1)
; ============================================================
Unwind Info:
@@ -124,5 +139,5 @@ Unwind Info:
FrameRegister : none (0)
FrameOffset : N/A (no FrameRegister) (Value=0)
UnwindCodes :
- CodeOffset: 0x05 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 3 * 8 + 8 = 32 = 0x20
+ CodeOffset: 0x05 UnwindOp: UWOP_ALLOC_SMALL (2) OpInfo: 5 * 8 + 8 = 48 = 0x30
CodeOffset: 0x01 UnwindOp: UWOP_PUSH_NONVOL (0) OpInfo: rbx (3) Looks like we remove a runtime lookup which makes loop inversion cheaper (since it no longer has to be duplicated). Loop inversion kicks in but that then breaks RBO's ability to remove a redundant branch. |
The large size regressions
are because of loop cloning. If I run diffs for win-x64 coreclr_tests.run and libraries_tests.run with loop cloning disabled I get these diffs. |
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
cc @dotnet/jit-contrib |
gtNodeHasSideEffects
is meant to check if a node has side effects when you exclude its children. However, it was checking arguments of calls which is more conservative than expected.The actual reason we were doing that seems to be
gtTreeHasSideEffects
that sometimes wants to ignoreGTF_CALL
on pure helper calls. It was relying on this check ingtNodeHasSideEffect
; instead move it togtTreeHasSideEffects
where it belongs.This is an alternative fix for #106129 (alternative to #106183); there we leave a
COMMA(CORINFO_HELP_RUNTIMELOOKUP_METHOD, ...)
around because extracting side effects from op1 does not end up getting rid of the call.Fix #106129